Search CORE

3 research outputs found

Decoding billions of integers per second through vectorization

Author: Aksyonoff A
Büttcher S
Jones DM
Witten IH
Publication venue: 'Wiley'
Publication date: 01/01/2015
Field of study

In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap

arXiv.org e-Print Archive

R-libre

Crossref

Fine-grained document clustering via ranking and its application to social media analytics

Author: A Aksyonoff
A Fahad
A Spink
AK Jain
C Rijsbergen Van
C Wang
CD Manning
CD Manning
DM Blei
DM Blei
F Pedregosa
H Chen
H Hu
H-P Kriegel
J Chen
J Shi
JD Hunter
JH Ward Jr
M Ester
M Gellman
M Widenius
N Fuhr
N Jardine
N Tomašev
O Zamir
RM Losee
S Papadopoulos
S Trepte
SE Robertson
SP Lloyd
W He
W Medhat
WB Johnson
X Hu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Extracting valuable insights from a large volume of unstructured data such as texts through clustering analysis is paramount to many big data applications. However, document clustering is challenged by the computational complexity of the underlying methods and the high dimensionality of data, especially when the number of required clusters is large. A fine-grained clustering solution is required to understand a data set that represents heterogeneous topics such as social media data. This paper presents the Fine-Grained document Clustering via Ranking (FGCR) approach which leverages the search engine capability of handling big data efficiently. Ranking scores from a search engine are used to calculate dynamic clusters’ representations called loci in an unsupervised learning setting. Clustering decisions are efficiently made based on an optimal selection from a small subset of loci instead of the entire cluster set as in the conventional centroid-based clustering. A comprehensive empirical study on several social media data sets shows that FGCR is able to produce insightful and accurate fine-grained solution. Moreover, it is magnitudes faster and requires less computational resources compared to other state-of-the-art document clustering approaches

Crossref

Queensland University of Technology ePrints Archive

Fine-grained document clustering via ranking and its application to social media analytics

Author: A Aksyonoff
A Fahad
A Spink
AK Jain
C Rijsbergen Van
C Wang
CD Manning
CD Manning
DM Blei
DM Blei
F Pedregosa
H Chen
H Hu
H-P Kriegel
J Chen
J Shi
JD Hunter
JH Ward Jr
M Ester
M Gellman
M Widenius
N Fuhr
N Jardine
N Tomašev
O Zamir
RM Losee
S Papadopoulos
S Trepte
SE Robertson
SP Lloyd
W He
W Medhat
WB Johnson
X Hu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref